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Abstract 

Gaussian  kernel  regularization  is  widely  used  in  the  machine  learning  literature  and  proven 
successful  in  many  empirical  experiments.  The  periodic  version  of  the  Gaussian  kernel  reg¬ 
ularization  has  been  shown  to  be  minimax  rate  optimal  in  estimating  functions  in  any  finite 
order  Sobolev  spaces.  However,  for  a  data  set  with  n  points,  the  computation  complexity  of 
the  Gaussian  kernel  regularization  method  is  of  order  0(n3). 

In  this  paper  we  propose  to  use  binning  to  reduce  the  computation  of  Gaussian  kernel 
regularization  in  both  regression  and  classification.  For  the  periodic  Gaussian  kernel  regression, 
we  show  that  the  binned  estimator  achieves  the  same  minimax  rates  of  the  unbinned  estimator, 
but  the  computation  is  reduced  to  0(m3)  with  m  as  the  number  of  bins.  To  achieve  the 
minimax  rate  in  the  fc-th  order  Sobolev  space,  m  needs  to  be  in  the  order  of  0(kn1^2k+1'1), 
which  makes  the  binned  estimator  computation  of  order  0{n)  for  k  =  1  and  even  less  for  larger 
k.  Our  simulations  show  that  the  binned  estimator  (binning  120  data  points  into  20  bins  in 
our  simulation)  provides  almost  the  same  accuracy  with  only  0.4%  of  computation  time. 

For  classification,  binning  with  the  L2-loss  Gaussian  kernel  regularization  and  the  Gaussian 
kernel  Support  Vector  Machines  is  tested  in  a  polar  cloud  detection  problem.  With  basically  the 
same  computation  time,  the  T2-loss  Gaussian  kernel  regularization  on  966  bins  achieves  better 
test  classification  rate  (79.22%)  than  that  (71.40%)  on  966  randomly  sampled  data.  Using  the 
OSU-SVM  Matlab  package,  the  SVM  trained  on  966  bins  has  a  comparable  test  classification 
rate  as  the  SVM  trained  on  27,179  samples,  but  reduces  the  training  time  from  5.99  hours  to 
2.56  minutes.  The  SVM  trained  on  966  randomly  selected  samples  has  a  similar  training  time 
as  and  a  slightly  worse  test  classification  rate  than  the  SVM  on  966  bins,  but  has  67%  more 
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support  vectors  so  takes  67%  longer  to  predict  on  a  new  data  point.  The  SVM  trained  on  512 
cluster  centers  from  the  k-mean  algorithm  reports  almost  the  same  test  classification  rate  and 
a  similar  number  of  support  vectors  as  the  SVM  on  512  bins,  but  the  k-mean  clustering  itself 
takes  375  times  more  computation  time  than  binning. 

KEY  WORDS:  Asymptotic  minimax  risk,  Binning,  Gaussian  reproducing  kernel,  Regular¬ 
ization,  Rate  of  convergence,  Sobolev  space,  Support  Vector  Machines. 


1  Introduction 

The  method  of  regularization  has  been  widely  used  in  the  nonparametric  function  estimation 
problem.  The  problem  begins  with  estimating  a  function  /  using  data  (x;,?/,),  i  —  1,  •  •  •  ,n, 
from  a  nonparametric  regression  model: 

Vi  =  f(xi)  +  €i,  i  =  1,  •  •  •  ,n,  (1) 

where  x*  G  Rd ,  i  =  1  are  regression  inputs  or  predictors,  y^s  are  the  responses, 

and  e*’s  are  i.i.d.  V(0,a2)  Gaussian  noises.  The  method  of  regularization  takes  the  form  of 
Ending  /  G  T  that  minimizes 

L{f,  data)  +  AJ(/)  (2) 

where  L  is  an  empirical  loss,  often  taken  to  be  the  negative  log-likelihood.  J(f)  is  the 
penalty  functional,  usually  a  quadratic  functional  corresponding  to  a  norm  or  semi-norm  of 
a  Reproducing  Kernel  Hilbert  Space  (RKHS)  T .  The  regularization  parameter  A  trades  off 
the  empirical  loss  with  the  penalty  J(f).  In  the  regression  case  we  may  take  L(/,data)  = 
Y^=i(Vi  ~  f(xi))2  and  the  penalty  functional  J(f)  usually  measures  the  smoothness. 

In  the  nonparametric  statistics  literature,  the  well-known  smoothing  spline  (cf  Wahba 
1990)  is  an  example  of  regularization  method.  The  reproducing  kernel  Hilbert  space  used 
in  smoothing  spline  is  a  Hilbert  Sobolev  space  and  the  penalty  J(/)  =  J[/^(x)]2dx  is  the 
norm  or  semi-norm  in  the  space.  The  reproducing  kernel  of  the  Hilbert  Sobolev  space  was 
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nicely  covered  in  Wahba  (1990),  and  the  commonly  used  cubic  spline  corresponds  to  the  case 
m  —  2. 

In  the  machine  learning  literature,  Support  Vector  Machines  (SVM)  and  regularization 
networks,  which  are  both  regularization  methods,  have  been  used  successfully  in  many  prac¬ 
tical  applications.  Smola  et  al  (1998),  Wahba  (1999),  and  Evgeniou  et  al  (2000)  make  the 
connection  between  both  methods  and  the  methods  of  regularization  in  the  Reproducing  Ker¬ 
nel  Hilbert  Space  (RKHS).  SVM  uses  a  hinge  loss  function  L(f,  data )  =  J))™=1(1  —  yif(xi))+ 
in  (2)  with  labels  yl  coded  as  {  —  1,1}  in  the  two-class  case.  The  penalty  functional  J(f) 
used  in  SVM  is  the  norm  of  the  RKHS  (see  Vapnik  1995  and  Whaba  et  al  1999  for  details). 

One  particularly  popular  reproducing  kernel  used  in  the  machine  learning  literature  is 
the  Gaussian  kernel,  which  is  defined  as  G(s,t )  =  (2n)~1^2uj~1exp((s  —  t)2 /2cu2) .  Girosi  et  al 
(1993)  and  Smola  et  al  (1998)  showed  that  the  Gaussian  kernel  corresponds  to  the  penalty 
functional 

00  .2m  roc 

W)  =  £  ^  /  [/"“’Ml2*.  (3) 

m= 0  ' 

Smola  et  al  (1998)  also  introduced  the  periodic  Gaussian  reproducing  kernel  for  estimating 
27r-periodic  functions  in  (— vr,  n\  as  the  kernel  corresponding  to  penalty  functional 

00  .2m  ptv 

Uf)  =  lfm,a)?dx  (4) 

Using  the  equivalence  between  the  nonparametric  regression  and  Gaussian  white  noise 
model  shown  in  Brown  and  Low  (1996),  Lin  and  Brown  (2004)  showed  asymptotic  proper¬ 
ties  of  the  regularization  using  a  periodic  Gaussian  kernel.  The  periodic  Gaussian  kernel 
regularization  is  rate  optimal  in  estimating  functions  in  all  finite  order  Sobolev  spaces.  It  is 
also  asymptotically  minimax  for  estimating  functions  in  the  infinite  order  Sobolev  Space  and 
the  space  of  analytic  functions.  These  asymptotic  results  on  the  periodic  Gaussian  kernel 
gave  a  partial  explanation  of  the  success  of  the  Gaussian  reproducing  kernel  in  practice.  In 
section  2,  we  describe  the  periodic  Gaussian  kernel  regularization  in  nonparametric  regres¬ 
sion  setup  and  review  the  asymptotic  results,  which  will  be  compared  to  the  binning  results 
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in  section  4.  Although  having  good  statistical  properties,  the  Gaussian  kernel  regularization 
method  is  computationally  very  expensive,  usually  of  order  0(n3)  on  n  data  points.  It  is 
computationally  infeasible  when  n  is  too  large. 

In  this  paper,  motivated  by  the  application  of  binning  technique  in  nonparametric  re¬ 
gression  (cf  Hall  et  al  1998),  we  study  the  effect  of  binning  in  the  periodic  Gaussian  kernel 
regularization.  We  first  show  the  eigen  structure  of  the  periodic  Gaussian  kernel  in  the  finite 
sample  case,  then  the  eigen  structure  is  used  to  prove  the  asymptotic  minimax  rates  of  the 
binned  periodic  Gaussian  kernel  regularization  estimator.  The  results  on  the  kernel  matrix 
are  given  in  section  3. 

In  section  4,  we  show  the  binned  estimator  achieves  the  same  minimax  rates  of  the 
unbinned  estimator,  while  the  computation  is  reduced  to  0(m3)  with  m  as  the  number  of 
bins.  To  achieve  the  minimax  rate  in  the  fc-tli  order  Sobolev  space,  m  needs  to  be  in  the 
order  of  0(kn1^2k+1^),  which  makes  the  binned  estimator  computation  to  be  0{n)  for  k  =  1 
and  even  less  for  larger  k.  For  estimating  functions  in  the  Sobolev  space  of  infinite  order, 
the  number  of  bins  m  only  needs  to  be  in  the  order  of  0(\Jlog(n ))  to  achieve  the  minimax 
risk.  For  the  simple  average  binning,  the  optimal  regularization  parameter  A b  for  binned 
data  has  a  simple  relationship  with  the  optimal  A  for  the  unbinned  data,  Xb  ~  mX/n  and  u 
stays  the  same.  In  practice,  choosing  parameters  (As,  u)  by  Mallow’s  Cp  statistics  achieves 
the  asymptotic  rates. 

In  section  5,  experiments  are  carried  out  to  assess  the  accuracy  and  the  computation 
reduction  of  the  binning  scheme  in  regression  and  classification  problems.  We  first  run 
simulations  to  test  binning  periodic  Gaussian  kernel  Regularization  in  the  nonparametric 
regression  setup.  Four  periodic  functions  with  different  order  of  smoothness  are  used  the 
simulation.  Comparing  to  the  unbinned  estimators  on  120  data  points,  the  binned  estimators 
(6  data  in  each  bin)  provide  the  same  accuracy,  but  requires  only  0.4%  of  computation. 

For  classification,  binning  on  the  L2-loss  Gaussian  kernel  regularization  and  the  Gaussian 
kernel  Support  Vector  Machines  are  tested  in  a  polar  cloud  detection  problem.  With  the 
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same  computation  time,  the  L2-loss  Gaussian  kernel  regularization  on  966  bins  achieves 
better  accuracy  (79.22%)  than  that  (71.40%)  on  966  randomly  sampled  data.  Using  the 
OSU-SVM  Matlab  package,  the  SVM  trained  on  966  bins  has  a  comparable  test  classification 
rate  as  the  SVM  trained  on  27,179  samples,  and  reduces  the  training  time  from  5.99  hours 
to  2.56  minutes.  The  SVM  trained  on  966  randomly  selected  samples  has  a  similar  training 
time  as  and  a  slightly  worse  test  classification  rate  than  the  SVM  on  966  bins,  but  has  67% 
more  support  vectors  so  takes  67%  longer  to  predict  on  a  new  data  point. 

Compare  to  k-rnean  clustering,  another  possible  SVM  training  sample-size  reduction 
scheme  proposed  in  Feng  and  Mangasarian  (2001),  binning  is  much  faster  than  k-rnean 
clustering.  The  SVM  trained  on  512  cluster  centers  from  the  k-mean  algorithm  reports 
almost  the  same  test  classification  rate  and  a  similar  number  of  support  vectors  as  the  SVM 
on  512  bins,  but  the  k-mean  clustering  itself  takes  375  times  more  computation  time  than 
binning.  Therefore,  for  both  the  regression  and  classification  in  practice,  binning  Gaussian 
kernel  regularization  reduces  the  computation  and  keeps  the  estimation  or  classification 
accuracy.  Section  6  contains  summaries  and  discussions. 


2  Periodic  Gaussian  Kernel  Regularization 

Lin  and  Brown  (2004)  studied  the  asymptotic  properties  of  the  periodic  Gaussian  kernel  reg¬ 
ularization  in  estimating  27r-periodic  functions  on  (— vr,  n\  in  three  different  function  spaces. 
Using  the  asymptotic  equivalence  between  the  nonparametric  regression  and  the  Gaussian 
white  noise  model  (see  Brown  and  Low  1996),  the  asymptotic  properties  of  the  periodic 
Gaussian  kernel  regularization  are  proved  in  the  Gaussian  white  noise  model.  In  this  sec¬ 
tion,  we  introduce  the  periodic  Gaussian  regularization  and  review  the  asymptotic  results 
by  Lin  and  Brown  (2004)  in  the  nonparametric  regression  setting. 
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2.1  Nonparametric  Regression 


In  this  paper,  we  consider  estimating  periodic  function  on  (0, 1]  using  periodic  Gaussian 
regularization.  With  data  i  =  1,  observed  from  model  (1)  at  equal  space 

designed  points  x^s,  the  method  of  periodic  Gaussian  kernel  regularization  with  L2  loss 
estimates  /  by  a  periodic  function  f\  that  minimizes 

n 

-  f{xi))2  +  A Jpg(f)  (5) 

2=1 

where  Jpg(f )  is  the  norm  of  the  corresponding  RKHS  Tk  of  the  periodic  Gaussian  kernel 
(Smola  et  al,  1998) 

OO 

K(s,t )  =  2  ^exp(— 12uj2/2)  cos(27t/(s  —  t)).  (6) 

1=0 

The  theory  of  reproducing  kernel  Hilbert  space  guarantees  that  the  solution  to  (5)  over 
Tk  falls  in  the  finite  dimensional  space  spanned  by  {K{x^  •),  i  =  1,  -  -  - ,  rz,}  (see  Wahba 
1990  for  an  introduction  to  the  theory  of  reproducing  kernels).  Therefore,  we  can  write  the 
solution  to  (5)  as  f(x)  =  Y^i=i  (kK{xi,x)  and  the  minimization  problem  can  then  be  solved 
in  this  finite  dimensional  space.  Using  the  finite  expression,  (5)  becomes 

min [(y  -  G{n)c)T(y  -  G(n)c)  +  A cTG(n)c],  (7) 

C 

where  y  =  (yi,  ■  ■  ■ ,  yn)T,  c  =  (ci,  •  •  • ,  cn)T,  and  G hd  as  a  n  x  n  matrix  K(xi,Xj).  The 
solution  is  c  =  ( G ^  +  XI)~1y  with  /  being  a  n  x  n  identity  matrix.  The  fitted  values  are 
y  =  G^c  =  G^n\G^  +  XI)-1  y  =  Sy ,  which  is  a  linear  estimator.  The  asymptotic  risk  of 
this  estimator  is  shown  in  the  next  section. 

2.2  Asymptotic  Properties 

We  briefly  review  the  asymptotic  results  shown  in  Lin  and  Brown  (2004)  in  this  section  and 
compare  them  to  the  binned  estimators  in  section  4.  The  asymptotic  risk  of  periodic  Gaussian 
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regularization  is  studied  in  estimating  periodic  function  from  three  types  of  function  spaces: 
spaces  of  Sobolev  ellipsoid  of  finite  order,  ellipsoid  spaces  of  analytic  functions,  and  Sobolev 
spaces  of  infinite  order,  which  are  defined  as  follows.  (Instead  of  working  with  27r-periodic 
functions  on  (— 7r,7r],  we  study  periodic  functions  on  (0,1]  in  this  paper.) 

The  first  type  of  function  space,  fc-th  order  Sobolev  ellipsoid  Hk(Q ),  is  defined  as 

Hk(Q)  =  {/  6  L2( 0, 1)  :  /  is  periodic,  [  [ f(t )]2  +  [f[k\t)}2dt  <  Q}.  (8) 

J  o 

It  has  an  alternative  definition  in  the  Fourier  space  as: 

OO  OO 

H\Q)  =  {/  :  m  =  £>*(*)■  S  Q,  7o  =  1.721-1  =  7 21  =  l2k  +  U.  (9) 

z=o  1=0 

where  0o(t)  =  1,  <p2i-i(t)  =  v/2  sin(27 rlt),  02 i(t)  =  \[2  cos(27r/t)  are  the  classical  trigonometric 
basis  in  L2( 0, 1)  and  6i  =  fl  f(t)cj)i(t)dt  is  the  corresponding  Fourier  coefficient. 

The  second  function  space  being  considered  is  the  ellipsoid  space  of  analytic  functions 
Aa(Q),  which  corresponds  to  (9)  with  the  exponentially  increasing  sequence  7 1  =  exp(al). 
The  third  function  space  is  the  infinite  order  Sobolev  space  which  corresponds  to 

(9)  with  the  sequence  po  =  1  and  722-1  =  72 1  =  e^2/2 .  Note  that  the  penalty  functional  Jpg 
of  periodic  Gaussian  kernel  regularization  is  the  norm  of  H™(Q). 

The  asymptotic  risk  of  y  is  determined  by  the  tradeoff  between  the  variance  and  the  bias. 
The  asymptotic  variance  of  y  =  G^c  =  G^(G^  +  A/)-1  y  depends  only  on  A  and  u,  while 
denotes  matrix  K(xi,  x3 ) .  I11  the  mean  time,  the  asymptotic  bias  depends  not  only  011  A 
and  u,  but  also  on  the  function  /  itself.  Lin  and  Brown  (2004)  proved  the  following  lemma 
using  the  equivalence  between  the  nonparametric  regression  and  the  Gaussian  white  noise 
model  (Brown  an  Law  1996). 

Lemma  1  (Lin  and  Brown  2004)  In  nonparametric  regression,  the  solution  y  to  the  periodic 
Gaussian  kernel  regularization  problem  (7)  has  an  asymptotic  variance: 

^Y^varijji)  =  (1/n)  J^(l  +  AA)“2  ~  2v/2tu~1n“1(-  log  A)1/2,  (10) 
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for  f3i  =  exp(l2uo2 / 2)  as  A  goes  to  zero.  The  asymptotic  bias  is: 

^  Y  bias2{jji)  ~  Y  A2A2(!  +  AA)“2^2-  (11) 

when  estimating  function  f(t )  = 

Based  on  (10)  and  (11),  following  asymptotic  results  about  the  periodic  Gaussian  kernel 
regularization  are  shown. 

Lemma  2  (Lin  and  Brown  2004)  For  estimating  functions  in  the  k-th  order  Sobolev  space 
Hk[Q),  the  periodic  Gaussian  kernel  regularization  has  a  minimax  risk: 

(2  k  +  i)fe-2fc/(2fc+i)gi/(2fc+i)n-2fc/(2fc+i)j 

achieved  when  log(n/\)/u 2  ~  (knQ)2F2k+1'> / 2.  T/ie  minimax  rate  for  estimating  functions 
in  Aa(Q)  is 

2n-1cG1(log77,), 

and  the  rate  is 

2v/2cn”1n_1  (log  n) 1//2 

for  estimating  functions  in  Hff{Q). 

It  is  well  known  that  the  asymptotic  minimax  risk  over  Hk(Q )  is  [2k/ (k  +  l)]2kF2k+1\2k  + 
l)i/(2fc+i)gi/(2fc+i)n— 2fc/ (2fc+i )  _  jf  we  calculate  the  efficiency  of  the  periodic  Gaussian  kernel 
regularization  in  terms  of  sample  sizes  need  to  achieve  the  same  risk,  the  efficiency  goes  to 
one  when  the  function  gets  smoother.  Therefore,  the  estimator  is  rate  optimal  in  this  case. 
For  estimating  functions  in  Aa(Q)  and  the  periodic  Gaussian  kernel  regularization 

achieves  the  minimax  risk  (see  Johnstone  1998  for  the  proof  of  minimax  risk  in  Aa(Q)). 
These  asymptotic  rates  in  Lemma  2  are  compared  with  the  binning  results  in  section  4. 


3  The  Eigen  Structure  of  the  Projection  Matrix 


Instead  of  working  with  the  Gaussian  white  noise  model,  we  directly  prove  Lin  and  Brown’s 
asymptotic  results  in  the  nonparametric  regression  model  in  this  section.  Although  the 
results  stated  in  section  2.2  are  proved  more  easily  in  the  Gaussian  white  noise  model  than 
in  the  regression  model,  knowing  the  eigen  structure  of  the  projection  matrix  S  (defined  as 
y  =  G'G)(£(n)  +  \iyly  =  Sy  in  section  2.1)  helps  us  understand  the  binned  estimators  in 
section  4.  To  study  the  variance-bias  trade-off  of  the  periodic  Gaussian  regularization,  we 
first  derive  the  eigen- values  and  eigen- vectors  of  G^  =  K(xt,  Xj)  and  make  the  connections 
between  them  to  the  functional  eigen-values  and  eigen-functions  of  the  reproducing  kernel 

For  a  general  reproducing  kernel  /?(•,•)  that  satisfies  J  f  R2(x,y)dxdy  <  oo,  there  exist 
an  orthonormal  sequence  of  eigen-functions  0i,  02,  •  •  •  ,  and  eigen- values  Pi  >  p2  >  •  •  •  >  0, 
with 

R(s,t)(f>i(s)ds  =  pi4>(t),  1  —  1,2, ...  (12) 

and  R(s,  t )  =  Ylih  ■  When  equally  spaced  points  aq,  ■  ■  ■ ,  xn  are  taken  in  (a,  b\,  we 

get  a  Gram  matrix  /?,■”  =  R{xi,Xj).  The  eigen- vectors  and  eigen- values  of  R ^  are  defined 
as  a  sequence  of  orthonormal  n  by  1  vectors  v\ ,  •  •  •,  vn  and  values  d.\  >  ■  ■  ■  >>  0 dn,  that 
satisfy 

R{n)v^)  =  d^V^,  1  =  1,2,  ...n  (13) 

and  R ^  =  E”=i  d^V^V^  .  The  eigen-values  d.^'1  have  limits:  lirrin-^ood^ib  —  a)/n  =  pi 
(c.f.  Williams  and  Seeger  2000). 

On  (0, 1],  the  eigen- functions  of  the  periodic  Gaussian  kernel  K  are  the  classical  trigono¬ 
metric  basis  functions  4>o(t)  =  1,  (f>2i-i(t)  =  \/2sin(27 dt),  c p2i(t )  =  \/2cos(27r It),  with  the 
corresponding  eigen- values  po  =  1  and  p2z-i  =  P21  =  exp(—l2uj2/2)  (For  notation  simplicity, 
the  labels  of  eigen- values  and  eigen-functions  start  from  0  instead  of  1).  It  is  straightforward 
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to  see  the  eigen  function  decomposition  when  we  rewrite  K (s,  t)  as: 


K(s,t )  =  2  'y^exp(— l2uv2/2)  cos(2ttI(s  —  t )) 

1=0 

OO 

=  e-l2u~/2  sin(27r/s)  a/2  sin(27rZt)  +  \/2  cos(2nls )  \/2  cos(27r/t)] 

1=0 

OO 

1=0 

where  0/(i)’ s  are  orthonormal  on  (0, 1].  When  n  equally  spaced  data  points  are  taken  over 
(0, 1],  such  as  Xi  =  —  ^  +  ±,  the  Gram  matrix  G^n)  =  K(xi,Xj )  has  the  following  property: 

Theorem  1  The  Gram  matrix  G^  =  Kfx^Xj )  at  equal-spaced  data  points  xi,  ■  ■  - ,  xn  over 
(0, 1]  has  eigen-vectors  VqU\  vjn\  •  •  • ,  (indexed  from  0  to  n  —  1): 

V0H  =  Vl/n{l,  ■  ■  ■ ,  1)T  =  \AM0oOi),  •  •  • ,  Mxn))T, 

Vj;n)  =  \/2/n(sm(2nhxi),  •  •  • ,  sin(27rkcn))T  =  \fl/n{^>i{x{),  •  •  • ,  (pi(xn))T ,  for  odd  l, 

Vp  =  \/2/n(cos(27rha;i),  •  •  • ,  cos(27t hxn))T  =  \Jl/n(4>i(xi),  •  •  • ,  <fn(xn))T  for  even  l, 

where  h  —  [(/  +  l)/2];  l  =  1,  •  •  •  ,n  —  1,  and  [a]  stands  for  the  integer  part  of  a.  Their 
corresponding  eigen-values  are: 

OO 

4"’  =  2n£(-l)*p2*» 

k= 0 
oo 

=  n{P«  +  YX-l)kWn+h  +  (  —  l/  2hPfen-/i]} 

k= 1 

The  proof  is  given  in  the  appendix.  This  theorem  shows  that  the  eigen- vector  is  exactly 

the  evaluation  of  eigen-function  </>;(•)  at  x\,  ■  ■  ■  ,xn,  scaled  by  y/ (1  jn).  It  is  worth  to  point 
out  that  this  exact  relationship  between  eigen-functions  and  eigen- vectors  does  not  generally 
hold  for  other  kernels  and  data  distributions.  For  general  kernels  and  data  distributions,  one 
can  only  prove  that  the  eigen-vectors  converges  to  the  corresponding  eigen-functions  as  the 
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sample  size  goes  to  infinity.  Therefore,  the  eigen-vectors  can  not  be  explicitly  written  out  in 
the  finite  sample  case. 


With  the  eigen  decomposition  of  Gl'n\  we  now  study  the  variance-bias  trade-off  of  the  pe¬ 
riodic  Gaussian  kernel  regularization.  Using  the  matrix  notation,  let  =  (U0^,  •  •  • ,  V^\ ) 
and  let  =  diag(cl^\  •  •  • ,  d^1)  be  an  n  by  n  diagonal  matrix  ,  then  G ^  =  V'WdWV'W7’. 

,(n) 

For  the  variance  term,  recall  S  =  G^  (G^  +  A/)-1,  we  have  S  =  V^diag(-^ — )U^T. 

dl  +a 

Therefore,  the  variance  term  is: 


n- 1 


n 


1  1  1  “  "  1  n_1  pW/,n 

E  var(y<)  =  -‘™«(sTs)  =  -  E<^ry;)2  =  ^  £<>5  ‘ 


n—  1 


Since  limn^ocd\ni / n  =  pi  for  l  >  0  and  pz  =  l//3z,  we  get 


n  ^  '  n  pi  +  (A/n) 

which  is  the  same  as  the  asymptotic  variance  shown  in  (10). 


i£(1+«-)r2. 

n  n 


For  the  bias  term,  we  expand  f(t)  as  f(t )  =  Using  the  relationship  between 

Wn)  and  0(-)  in  Theorem  1,  we  can  write  vector  F  =  (f(x i),  •  •  • ,  f(xn))T  as 

n—  1 

f  =  ^©jn)^(n)  =  u(n)e(n), 

1=0 


where 


fc=0 


@1  1  —  +  (  —  1)^  ~h6kn-h\}-> 

k= 1 

for  1  <  l  <  n  —  1  and  h  =  ("(/  +  l)/2].  Thus,  the  bias  term  is  written  as: 

=  1((S-/)F)t((S-/)F) 
n  n 

=  \v^Uiag(-rF-)V^TF)T(V^diag(-F—)V^TF) 
n  d)]  +  A  d)]  +  A 


I  Q;(n)^  i2  -  1 Vn  u 
+  ^  l^Mdf»)/n  +  A/n) 
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rv j 


E^2( 

E02( 


V"  \2 

Pz  +  A/n 
AA/n  ■.  2 
1  +  f3i\/n 


since  limn^oo 


Oi,  lirrin^ocd^ /n 


Pi,  and  pi  =  1/A- 


Both  the  variance  and  bias  are  the  same  as  in  Lemma  1  derived  by  using  the  Gaussian 
white  noise  model  in  Lin  and  Brown  (2004).  Although  it  is  easier  to  prove  Lemma  1  and  2 
in  the  Gaussian  white  noise  model  than  through  the  eigen  expansion  derived  above,  binned 
estimators  in  the  nonparametric  regression  setup  do  not  directly  convert  to  the  Gaussian 
white  noise  model.  Therefore,  the  eigen  expansion  is  used  to  prove  the  asymptotic  properties 
of  binning  the  periodic  Gaussian  kernel  regularization  in  the  next  section. 


4  Binning  Periodic  Gaussian  Kernel  Regularization 

Although  the  periodic  Gaussian  regularization  method  has  good  asymptotic  properties,  the 
computation  of  the  estimator  y  =  G-^ (G-1  +  A I)~1y  is  very  expensive,  taking  0(n3)  to 
invert  the  n  by  n  matrix  G ^  +  XI.  When  the  sample  size  gets  large,  the  computation  is  not 
even  feasible.  In  nonparametric  regression  estimation,  Hall  et  al  (1998)  studied  the  binning 
technique.  In  this  section,  we  use  the  explicit  eigen  structure  of  the  periodic  Gaussian 
kernel  to  study  the  effect  of  binning  on  the  asymptotic  properties  of  periodic  Gaussian 
regularization. 

4.1  Simple  Binning  Scheme 

Let  us  take  equally  spaced  n  data  points  in  (0,1],  say  Xj  =  — ^  Without  loss  of 

generality,  we  assume  the  number  of  design  point  n  equals  m  x  p,  while  m  is  the  number 
of  bins  and  p  is  number  of  data  points  in  each  bin.  Using  equally  spaced  binning  scheme, 
let  us  denote  the  centers  of  bins  as  Xj  =  (x(j_i)xp+i  +  ■  ■  ■  +  X(j_i )xp+p)/p  and  the  average  of 
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observations  in  each  bin  as  y3  =  (y(j- i)xp+i  H - \-y^-i)xp+p)/p,  for  j  —  1,  •  ■  ■  ,  m.  When  we 

apply  the  periodic  Gaussian  regularization  to  the  binned  data,  the  estimated  function  is  in 
the  form  of  f(x)  =  Y2jLi  CjK(x,Xj),  where  c  is  the  solution  of 


min (y  -  G{m)c)T(y  -  G{m)c )  +  XBcTGim)c, 

C 


(14) 


with  g[j  =  K(xi,Xj),  y  =  {y\ ,  •  •  • ,  ym)  and  XB  as  the  regularization  parameter.  Similar  to 
the  estimator  derived  in  section  2.1,  the  solution  of  (14)  is  c  =  ( G ^  +  A ^/)_1|/.  With  this 
explicit  form  of  the  binned  estimator,  we  study  its  asymptotic  properties  next.  Let 


/ 


£>(m,n)  _ 


m/n  ■  ■  ■  m/n  0 

0  ■■■  0  m/n  ■■■  m/n  0 


0 

0 


\ 


(15) 


0  m/n  ■■■  m/n  j 


The  binned  estimator  can  be  written  as  y  =  (G'<"b  +  A BI)~lB^m,n^y  =  SBy  with 

G(/1/m>  =  K(xi,Xj )  being  a  n  by  rn  matrix.  From  this  expression,  it  is  straightforward  to  see 
that  the  computation  is  reduce  to  0(?n3)  since  the  matrix  inversion  is  taken  on  an  m  by  m 
matrix  now.  The  additional  computation  for  binning  the  data  itself  is  around  O(n),  which 
is  trivial  comparing  to  the  matrix  inversion  computation  0(n3). 

Using  this  matrix  expression,  The  variance  of  the  estimator  can  be  written  as: 

—  variy /)  =  — trace(SgSB )  =  — trace(SBS T)  (16) 

n  n  n 

Based  on  the  following  proposition,  the  variance  term  can  be  explicitly  written  out  using  the 
eigen-decomposition  of  SB. 


Proposition  1  Suppose  n  =  mp,  Xi  =  — ^  and  x3  =  (x(j- i)xp+i  H - h  xy-i )xp+p)/p. 

The  eigen-vectors  of  G^  and  the  eigen-vectors  of  G^  satisfy 


Qin^yim) 


for  k 


0, 1,  •  •  • ,  m. 
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The  proof  is  provided  in  the  appendix.  This  proposition  shows  that  the  eigen-vector  of  G 
are  projected  to  the  corresponding  eigen-vector  of  G ^  by  the  matrix  G('n,m'>  (Unfortunately, 


this  property  does  not  hold  for  general  kernels).  Following  this  relationship,  the  asymptotic 
variance  of  the  binned  estimator  is  provided  in  the  following  theorem: 


Theorem  2  The  asymptotic  variance  of  the  binned  estimator  y  =  G^n’rn\G^ + A# I)  1B('m,n'>y 
in  the  equally  spaced  binning  scheme  is: 


as  m  — >  oo,  n  — »  oo  and  Xb  — >  0.  The  expression  is  the  same  as  the  asymptotic  variance  of 
the  original  estimator  when  Xb  =  mX/n. 

See  the  proof  in  the  appendix.  Now  we  focus  on  the  bias  term,  which  depends  not  only  on 
the  projection  operation,  but  also  on  the  smoothness  of  the  underline  function  /  itself.  We 
have  the  following  theorem  for  the  bias  term: 

Theorem  3  In  the  equally  spaced  binning  scheme,  as  m  — ►  oo,  n  oo,  m/n  — >  0  and 
Xb  —>  0,  the  bias  of  the  binned  estimator  is: 


(18) 


when  estimating  function  f(t )  = 

The  theorem  is  proved  in  the  appendix.  With  the  asymptotic  variance  and  bias  obtained,  we 
show  in  the  next  section  that  the  asymptotic  minimax  rates  of  the  periodic  Gaussian  kernel 
regularization  are  kept  after  binning  the  data. 

4.2  Asymptotic  Rates  of  Binned  Estimators 

In  this  section,  we  study  the  asymptotic  rates  of  the  binned  periodic  Gaussian  kernel  regu¬ 
larization  in  estimating  functions  in  spaces  defined  in  section  2.2.  We  start  with  the  infinite 


14 


order  Sobolev  space.  As  shown  in  the  next  theorem,  the  binned  estimator  also  achieves  the 
minimax  rate  as  the  original  estimator  does  in  this  space. 

Theorem  4  The  minimax  rate  of  the  binned  estimator  y  =  G^n,m\G(jn">  +  A Bl)~X^m,n» 
for  estimating  functions  in  the  infinite  order  Sobolev  space  Hff(Q)  is: 

min  max  E[—(y  —  y)T  {y  —  y)\  ~  2v/2u'_1n~1(logn)1/2, 

(Q)  Tl 

which  is  the  same  rate  of  the  unbinned  estimator.  This  rate  is  achieved  when  m/n  — >  07  and 
m  is  large  enough  so  that  w2m2 / 2  >  log(4m/As).  The  parameter  A s  =  A s(n,m)  satisfies 
log(m/ As)  rs./  logn,  A  s/m  =  o(n  1(logn)1//2).  This  leads  m  to  be  in  an  order  of  0{^/ log  {n)) . 


Proof:  As  shown  in  Theorem  3,  the  bias  of  the  binned  estimator  is: 

m—  1 

y^Bias2(yi)  ~ 

n 


~  e>  AAb/” 


l  +  AAs/m' 

/=0  l=m 


+  E*S 


< 


< 


As 
4m 

A  B 
4  m 


m— 1 


E«?  +  E»? 


«=0 

oo 


l=m 


E  m2 

z=o 


(when  EE  >  l) 
4m 


and  As/9m/4rn  >  1  is  satisfied  as  w2m2 / 2  >  log(4m/As).  Then  the  asymptotic  risk  is 


-£[(£  -  2/)T(£  -  2/)]  <  -  2  +  ®  ~  2v/^  ln  1(1°Sn)1/2i 

77,  77,  ^ '  m  dm 


when  log (m/ As)  ~  logn  and  As/m  =  o(n  1(logn)1//2). 


□ 


The  theorem  shows  the  binned  estimator  achieves  the  same  asymptotic  rate  of  the  original 
estimator  when  m  is  the  order  of  login)).  Therefore,  the  computation  complexity  of 
the  binned  estimator  is  around  O(logn)3/2.  In  practice,  we  do  not  expect  m  can  be  as  small 
as  in  this  order,  since  this  type  of  function  is  not  realistic  in  common  applications.  Next  we 
study  the  case  of  estimating  functions  in  the  Sobolev  space  Hk(Q)  with  finite  order  k. 
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Theorem  5  The  minimax  rate  of  the  binned  estimator  y  =  G^n,m\G+^  +  A#/)  1B+l,n^y 
for  estimating  functions  in  the  k-th  order  Sobolev  space  Hk(Q)  is: 

min  max  -E[(y  -  y)T(y  -  y)]  ~  (2k  +  i)k-2kA2k+^QlS2k+lln-2kl{-2k+1\ 
m,w,XB  6eHk(Q)  n 

which  is  the  same  rate  of  the  unbinned  estimator.  This  rate  is  achieved  when:  m/n  — >  0  and 
m  is  large  so  that  m  >  \f2w~x(—  \og(Xs  /  m))1^2 .  The  parameter  A b  =  XB(n,m,w)  satisfies 
log(m/ XB)/w2  rs./  (knQ)2^2k+1'>  / 2.  The  condition  leads  m  to  be  in  an  order  of  0(kn1^2k+1^). 


Proof:  We  first  study  the  bias  term,  let  Am  =  A s/m 


B(m,w,Xm) 


max 

d£Hk(Q) 


m— 1 


£«?( 


(diXm 

1  +  fllXm 


)2  +  £«i2 

l=m 


m— 1 


00 


max 

0£Hk(Q) 


+  A  ^m1)  Vz  V,*?)  +  Pi  \pltf) 

1=0  l=m 


Here  p2z- 1  =  p2z  =  1  +  l2k  are  the  coefficients  in  the  definition  (8)  of  Sobolev  ellipsoid 
Hk(Q).  The  maximum  is  achieved  by  putting  all  mass  Q  at  the  l  term  that  maximizes 
X”=Lo1(l  +  A'”lAm)“V1  +  XSmPz_1-  First  let  us  find  the  maximizer  of 


A\m(x)  =  [1  +  Xr^exp(—x2w2 /2)]  2(1  +  x2k)  1  over  x  >  0 


As  shown  in  Lin  and  Brown  (2004),  the  maximizer  x0  satisfies  x\w2 /2  ~  (—  log  Am)  and  the 
maximum  AXrn(x0 )  ~  Xq  2k  ~  2 ~kw2k(—  log  Xm)~k.  When  m  >  x0  and  m  >  v/2tc_1(—  log  Am)1//2, 
we  have  (l+m2fc)_1  <  2~kw2k(—  log  Xm)~k.  Therefore,  the  maximum  value  of  B(m,  w,  Xm )  ~ 
Q2~kw2k(—  log  Xm)~k.  Thus  we  have  the  following: 

max  -E[(y  -  y)T  (y  -  y)\  ~  Q2~kw2k(- log  Xm)~k  +  2\/2w~ln~1(-  log  Am)1/2. 
eeHk(Q)  n 

This  asymptotic  rate  (2k  +  i')k~2k/(2k+1)Q1/(2k+1)n-2k/(2k+1)  -g  achieved  when  the  parameters 
satisfy  log(?n/As)/w2  ~  (knQ)2^2k+1'> / 2  and  m  >  V2w~1(—log(XB/m))1^2.  □ 

The  theorem  shows  the  binned  estimator  achieves  the  same  minimax  rate  of  the  original 
estimator  in  the  finite  order  Sobolev  space.  The  same  result  also  holds  in  the  ellipsoid  Aa(Q) 
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of  analytic  functions  but  we  would  not  prove  it  here.  Comparing  the  order  of  smallest  m 
needed  to  achieve  the  optimal  rates  for  estimating  functions  with  different  order  of  smooth¬ 
ness,  we  find  that  the  smoother  functions  require  a  smaller  number  of  bins.  For  instance, 
the  optimal  rate  of  estimating  a  function  in  the  fc-th  order  Sobolev  space  can  be  achieved  by 
binning  the  data  into  m  =  0{knl^2k+1'1)  number  of  bins.  The  number  of  bins  m  decreases 
as  k  increases.  Binning  reduces  the  computation  from  0(n3)  to  0{m3)  =  O(n)  for  k  —  1,  to 
0(n3//5)  for  k  =  2,  and  even  less  for  larger  k  values. 

5  Experiments 

Simulations  and  real  data  experiments  are  conducted  in  this  section  to  study  the  effect 
of  binning  in  regressions  and  classifications.  We  first  use  simulations  to  study  binning  in 
estimating  periodic  functions  in  the  nonparametric  regression  setup.  The  results  show  that 
the  accuracy  of  binned  estimators  are  no  worse  than  the  original  estimators  when  function 
are  smooth  enough.  Meanwhile,  the  computation  is  reduced  to  0.4%  of  the  computation 
original  estimator,  when  the  original  120  data  points  in  binned  into  20  bins. 

For  classification,  we  test  the  binning  idea  on  a  problem  raised  in  a  polar  cloud  detection 
problem  (cf  Shi  et  al  2004).  The  L 2  loss  and  hinge  loss  functions  are  both  tested  in  this  ex¬ 
periment.  In  both  cases,  the  binned  classifier  provide  competitive  results  to  classifiers  trained 
from  full  data.  Furthermore,  the  computation  time  is  significantly  reduced  by  binning.  As 
an  illustration,  the  time  for  training  SVM  on  966  bins  is  2.56  minutes,  only  0.071%  of  5.99 
hours  that  is  used  to  train  SVM  on  27179  samples,  which  provide  slightly  better  accuracy 
than  the  SVM  on  966  bins. 
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Figure  1:  Regression  functions  and  data  used  in  the  simulations. 

5.1  Non-parametric  Regression 

Data  are  simulated  from  the  regression  model  (1)  with  noise  iV(0, 1),  using  four  periodic 
functions  on  (0, 1]  with  different  order  of  smoothness. 

fi(x)  =  sin2(27ra;)l(3:<i/2) 

f2(%)  —  —X  +  2(x  —  1/ 4)1(x>1/4)  +  2(—  X  +  3/4)l(a;>3/4) 
f3(x)  =  1/(2  -sin(27rx)) 

fi(x)  =  2  +  sin(27rx)  +  2  cos(27tx)  +  3  sin2(27rx)  +  4  cos3(27ra)  +  5  sin3(27nr) 

The  plots  of  the  functions  are  given  the  left  of  Figure  1  and  the  data  are  plotted  at  the 
right.  The  first  function  has  a  second  order  of  smoothness.  The  second  function  has  the  first 
order  of  smoothness.  The  third  function  is  infinitely  smooth.  The  fourth  function  is  even 
smoother:  it  has  a  Fourier  series  that  only  contains  finitely  many  terms.  In  our  simulation, 
the  sample  size  n  is  set  to  120  and  the  number  of  bins  are  m  =  60,40,30,24,20,15,12, 
with  corresponding  numbers  in  each  bin  as  p  =  2, 3, 4,  5,  6, 8, 10.  All  simulations  are  done  in 
Mat  lab  6. 
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Figure  2:  Mean  square  errors  of  the  binned  estimators  v.s  the  number  of  data  points  in  each 
bin.  Left:  Binned  Periodic  Gaussian  kernel  regularization;  Right:  Binned  Gaussian  kernel 
regularization.  In  both  plots,  unbinned  estimators  are  those  with  1  data  in  each  bin. 

The  computation  of  the  periodic  Gaussian  regularization  is  sketched  as  follows.  We 
follow  Lin  and  Brown  (2004)  to  approximate  the  periodic  Gaussian  kernel  defined  in  (6).  A 
Gaussian  kernel  G(s,t)  =  (27r)~1//2u;_1e:rp((s  —  t)2  / 2u2)  is  used  to  approximate  K(s,t).  It 
is  shown  in  Willamson  et  al  (2001)  that  K(s,t )  =  YlkL-ooG((s  —  t  —  2fc7r)/27r).  Actually 
GJ(s,  t )  =  Yhk=-j  G((s  —  t  —  2kn)/2n)  for  J  —  1  is  already  a  good  approximation  to  K(s,  t ) 
with 

0  <  K(s,  t)  —  G1(s,  t)  <  2.1  x  10~20,  V(s  —  t)  E  (0, 1]  for  w  <  1. 

Therefore,  we  use  Gx(s,t)  as  an  easily  computable  proxy  of  K(s,t )  in  the  simulation. 

Over  the  data  generated  from  the  regression  model  (1)  on  the  four  functions  considered, 
we  compare  the  mean  squared  errors  of  the  binned  estimator  and  the  original  estimator.  For 
periodic  Gaussian  kernel  regularization,  we  search  over  w  =  0.3fci  —  0.1  for  k\  —  1,  •  •  • ,  10; 
and  A  =  exp(— 0.4fc2  +  7),  for  k2  =  1,  •  •  •  ,50.  Then  we  compute  the  binned  estimator  for 
the  number  data  in  each  bin  p  as  2,3,4,  5,6,8, 10  separately,  The  parameters  are  set  to  be 
u>  and  Xb  =  mX/n.  In  both  cases,  we  use  the  minimal  point  of  Mallow’s  Cv  to  choose  the 
parameter  (w,X b)- 
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The  simulation  runs  300  times.  The  left  panel  of  Figure  2  shows  the  averaged  mean 
squared  errors  against  the  number  of  data  points  in  each  bin  for  the  four  functions  (with 
the  unbinned  estimators  shown  as  those  with  one  data  in  each  bin  in  the  plot).  In  most  of 
the  cases,  the  average  errors  of  binned  estimators  are  not  significantly  higher  than  those  the 
original  estimators,  while  the  computation  is  reduced  from  0(  1203)  to  0(m3).  For  example, 
let  us  consider  the  estimator  using  6  data  points  in  each  bin  (m=20).  The  standard  error  (not 
shown  in  the  plot)  of  the  average  errors  are  computed  and  two  sample  t  tests  are  conducted 
to  compare  the  binned  estimator  to  the  original  estimator.  For  all  four  functions,  the  p- 
values  are  all  larger  than  0.1,  which  says  no  significant  loss  of  accuracy  by  binning  the  data 
to  20  bins  in  this  experiment.  In  the  mean  time,  the  computation  complexity  is  reduced  to 
O(203),  0.4%  of  O(1203)  on  full  data. 

In  our  experiment,  the  periodic  Gaussian  kernel  is  replaced  by  a  Gaussian  kernel,  which 
is  most  common  in  practice.  We  repeat  the  same  experiments  again  and  get  the  average 
mean  square  errors  plotted  in  the  right  panel  of  Figure  2.  The  errors  from  using  the  Gaussian 
kernel  are  generally  higher  than  those  from  the  periodic  Gaussian  kernel,  since  the  Gaussian 
kernel  does  not  take  in  account  of  the  fact  that  our  functions  are  periodic.  However,  the 
binned  estimators  have  almost  the  same  accuracy  as  the  unbinned  ones  when  there  are 
enough  number  (say  24)  of  bins  in  this  simulation.  The  computation  reduction  of  Gaussian 
kernel  is  the  same  as  in  the  periodic  Gaussian  case. 

5.2  Cloud  Detection  over  Snow  and  Ice  Covered  Surface 

In  this  section,  we  test  binning  in  a  real  classification  problem  using  Gaussian  kernel  regu¬ 
larization.  By  reducing  the  variance,  binning  the  data  is  expected  to  keep  the  classification 
accuracy  as  well  as  relieving  the  computation  burden  even  in  classification.  Here  we  illustrate 
the  effect  of  binning  using  a  polar  cloud  detection  problem  arising  in  atmospheric  science. 
In  polar  regions,  detecting  clouds  using  satellite  remote  sensing  data  is  difficult,  because  the 
surface  is  covered  by  snow  and  ice  that  have  similar  reflecting  signatures  as  clouds.  In  Shi 
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Figure  3:  MISR  image  and  expert  labels 

et  al  (2004),  the  Enhanced  Linear  Correlation  Matching  Classification  (ELCMC)  algorithm 
based  on  three  features  was  developed  for  polar  cloud  detection  using  data  collected  by  the 
Multi-angle  Imaging  SpectroRadiometer  (MISR). 

Thresholding  the  features,  the  ELCMC  algorithm  has  an  average  accuracy  about  91% 
(compare  to  expert  labels)  over  60  different  scenes,  with  around  55,000  valid  pixels  in  each 
scene.  However,  there  are  some  scenes  that  are  very  hard  to  classify  using  the  the  simple 
threshold  method.  The  data  set  we  investigate  in  this  paper  is  collected  in  MISR  orbit  18528 
blocks  22-24  over  Greenland  in  2002,  with  only  a  75%  accuracy  rate  by  the  ELCMC  method. 
The  MISR  red  channel  image  of  the  data  is  shown  in  the  left  panel  of  Figure  3.  It  is  not 
easy  to  separate  clouds  from  the  surface  because  the  scene  itself  is  very  complicated.  There 
are  several  types  of  clouds  in  this  scene,  such  as  low  clouds,  high  clouds,  transparent  high 
clouds  above  low  clouds.  Moreover,  this  scene  also  contains  different  types  of  surfaces,  such 
as  smooth  snow  covered  terrains,  rough  terrains,  frozen  rivers,  and  cliffs. 

Right  now,  the  most  reliable  way  to  get  large  volume  of  validation  data  for  polar  cloud 
detection  is  by  expert  labelling,  since  there  are  not  enough  ground  measurements  in  polar 
region.  The  expert  labels  from  our  collaborator  Prof.  Eugene  Clothiaux  (Department  of  Me¬ 
teorology,  PSU)  are  shown  at  the  right  panel  with  white  pixels  denoting  “cloudy”,  gray  pixels 
for  “clear”  and  black  for  “not  sure”.  There  are  54879  pixels  with  “cloudy”  or  “clear”  labels  in 
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this  scene  and  we  use  half  of  these  labels  for  training  and  half  for  testing  different  classifiers. 
Each  pixels  is  associated  with  a  three  dimensional  vector  X  =  (log(SD),CORR,  NDAI), 
computed  from  the  original  MISR  data  as  described  in  Shi  et  al  (2004).  Hence  we  build  and 
test  classifiers  based  on  these  three  features. 

We  test  binning  on  the  Gaussian  kernel  regularization  with  two  different  type  of  loss 
functions.  One  is  the  L2  loss  function  as  we  studied  in  this  paper,  and  the  other  is  the 
hinge  loss  function  corresponding  to  the  Support  Vector  Machines.  In  both  case,  we  binned 
the  data  based  on  the  empirical  marginal  distribution  of  the  three  predictors.  For  each 
predictor,  we  found  the  10%,  20%,  •  •  -,  90%  percentiles  of  the  empirical  distribution  and 
these  percentiles  serve  are  the  split  points  for  each  predictor.  Therefore,  we  get  1000  bins  in 
the  three  dimensional  space.  In  those  bins,  966  of  them  contain  data  and  34  are  empty.  Thus, 
the  966  bin  centers  are  our  binned  data  in  the  experiments.  The  computation  is  carried  out 
in  Matlab  6  on  a  desktop  computer  with  a  Pentium  4  2.4GHz  CPU  and  512M  memory. 

5.2.1  Binning  on  Gaussian  Kernel  Regularization  with  L2  loss 

The  Gaussian  kernel  regularization  with  the  L2  loss  function  is  tested  with  three  different 
setups  for  training  data.  The  first  setup  is  random  sampling  a  small  proportion  of  data 
as  training  data  and  train  classification  over  them.  This  is  the  common  approach  to  deal 
with  large  data  sets  and  it  serves  as  a  baseline  for  our  comparison.  In  the  second  setup,  the 
bin  centers  and  the  majority  vote  of  the  labels  in  each  bin  are  used  as  training  data  and 
responses.  Thus,  each  bin  center  is  treated  as  one  data  point  in  this  case.  In  the  last  setup, 
the  training  data  and  labels  are  the  bin  centers  and  the  proportion  of  l’s  in  each  bin.  To 
reflect  the  fact  that  different  bins  may  have  different  number  of  data  points,  we  also  give  a 
weight  to  each  bin  center  in  the  loss  function  in  the  last  setup.  In  all  there  setups,  a  half  of 
the  54879  data  points  are  left  out  for  choosing  the  best  parameters  u  and  A. 

In  the  first  setup,  we  randomly  sample  966  data  points  from  the  full  data  (54879  data 
points)  and  use  the  corresponding  label  y  (0  and  1)  to  train  the  classifier  y  =  KUJ(Ku+XI)~1y. 
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random  sample  size  966 

GKR-L2 

GKR-L2  on  966  bin 

GKR-L2 

Bagged  GKR-L2 

996  bin  centers 

centers  with  fuzzy  labels 

Accuracy 

71.40%  * 

77.77% 

75.86% 

79.22% 

Comp  Time 

81  x  26.24 

81  x  21  x  26.24 

3.87+  81  x  26.24 

3.87  +  81  x  26.24 

(seconds) 

=  35.42  minutes 

=  12.40  hours 

=  35.48  minutes 

=  35.48  minutes 

Table  1:  Binning  L2  Gaussian  kernel  regularization  for  cloud  detection.  Note:  *  The  average 
accuracy  of  21  runs. 

The  predicted  labels  are  set  to  be  the  indicator  function  I(y  >  0.5).  Cross-validation  is 
performed  to  chose  the  parameters  (cu,  A)  from  u  =  0.8  +  (i  —  5)  x  0.05  and  A  =  .  1  +  (j  —  5)  x 
0.005  for  i,  j  =  1,  •  •  • ,  9.  For  each  (c v,  A)  pair,  this  procedure  is  repeated  for  21  times  and  the 
average  classification  rate  is  reported.  The  best  average  classification  rate  is  71.40%  (with 
SE  0.43%).  With  the  classification  results  from  the  21  runs,  we  also  take  the  majority  vote 
over  the  results  to  build  a  “bagged”  classifier,  which  improve  the  accuracy  to  77.77%.  As 
discussed  in  Brciman  (1996)  and  Buhlmann  and  Yu  (2002),  Bagging  reduces  the  classification 
error  by  reducing  the  variance. 

In  the  second  setup,  the  966  bin  centers  are  used  as  training  data.  Cross-validation  is 
carried  out  to  End  the  best  parameters  (a;,  A)  over  the  same  range  as  in  the  Erst  setup.  The 
classifier  is  then  applied  to  the  full  data  to  get  an  accuracy  rate.  The  best  set  of  parameters 
lead  to  75.86%  of  accuracy  rate. 

In  the  third  setup,  we  solve  the  following  minimization  problem:  minc  {y—Kc)TW (y— 

Kc )  +  A cTKc,  with  the  weight  in  the  diagonal  matrix  W  being  proportional  to  the  number 
of  data  points  in  each  bin.  It  leads  to  the  solution  are  c  =  ( K  +  \W~1)~ly.  Doing  cross- 
validation  over  the  same  range  of  parameter,  we  achieve  a  79.22%  of  accuracy,  which  is  the 
best  results  with  sample  size  966  in  L2-loss. 

We  compare  the  computation  time  (in  Matlab)  of  those  setups  in  Table  1  as  well.  Training 
and  testing  the  L2  Gaussian  kernel  regularization  on  966  data  points  takes  26.24  seconds  on 
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average.  Using  cross-validation  to  chose  the  best  parameters,  it  takes  about  35.42  minutes 
(26.24  x  the  number  of  parameter  pairs  tested)  for  the  simple  classifier  in  the  first  setup, 
and  the  “bagged”  classifier  takes  12.40  hours.  In  the  second  and  third  setups,  binning 
the  data  in  966  bins  takes  3.87  seconds  and  the  training  process  takes  35.42  minutes.  So 
the  computation  of  binning  classifiers  takes  only  4.77%  (35.48min/12.40hr)  of  the  time  for 
training  the  “bagging”  classifier,  but  it  provides  better  estimation  results.  It  is  worthwhile 
to  point  out  that  the  training  step  of  these  classifiers  involves  inverting  an  n  by  n  matrix, 
the  computer  runs  out  of  memory  when  the  training  data  size  is  larger  than  3000. 

5.2.2  Binning  on  Gaussian  Kernel  SVM 

Gaussian  kernel  Support  Vector  Machines  is  a  regularization  method  using  the  hinge  loss 
function  in  expression  (2).  Because  of  the  hinge  loss  function,  a  large  proportion  of  the 
parameter  c\,  ■  ■  ■ ,  cn  are  zeros,  and  the  non-zeros  data  points  are  called  support  vectors  (see 
Vapnik  1995  and  Whaba  et  al  1999  for  details).  In  this  section,  we  study  the  effect  of 
binning  on  the  Gaussian  kernel  SVM  for  the  polar  cloud  detection  problem,  even  though  our 
theoretical  results  only  cover  the  L2  loss. 

The  software  that  we  used  to  train  the  SVM  is  the  Ohio  State  University  SVM  Classifier 
Matlab  Toolbox  (Junshui  Ma  et  al.  http :  //www .  eleceng.  ohio-state  .  edu/~raaj/osu_smv/) 
The  OSU  SVM  toolbox  implements  SVM  classifiers  in  C-I--I-  using  the  LIBSVM  algorithm  of 
Chili- Chung  Chang  and  Chili- Jen  Lin  (http :  //www.  csie  .ntu.  edu.  tw/simcj lin/1  ibsvm/). 
The  LIBSVM  algorithm  breaks  the  large  SVM  Quadratic  Programming  (QP)  optimization 
problem  into  a  series  of  small  QP  problems  to  allow  the  training  data  size  to  be  very  large. 
The  computational  complexity  of  training  LIBSVM  is  empirically  around  0(nf),  where  ri\ 
is  the  training  sample  size.  The  complexity  of  testing  is  0(sn2)  where  112  is  the  test  size 
and  s  is  the  number  of  support  vectors,  which  usually  increases  linearly  with  the  size  of  the 
training  data  set. 

Similar  to  the  L2  Gaussian  kernel  regularization  in  section  5.2.1,  the  Gaussian  kernel 
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random  sample  size  966 

SVM  on  966 

SVM 

SVM 

Bagged  SVM 

bin  centers 

size  27179 

Accuracy 

*85.09% 

86.07% 

86.08% 

86.46% 

Comp  Time 

81  x  1.85 

21  x  81  x  1.85 

3.87  +  81  x  1.85 

81  x  266.06 

(seconds) 

=  2.5  minutes 

=  52.11  minutes 

=  2.56  minutes 

=  5.99  hours 

#  Support  Vectors 

350 

~  7350 

210 

8630 

Table  2:  Binning  SVM  for  cloud  detection.  Note:  *  average  rate  of  21  runs  with  SE  0.18% 

SVM  is  tested  in  three  different  types  of  training  data.  The  first  two  setups  are  identical 
to  the  ones  used  in  the  L2  loss  case.  However,  the  third  setup  in  the  L2  case  is  not  easy 
to  carry  out  in  OSU  SVM,  since  the  OSU  SVM  training  package  does  not  take  the  fuzzy 
labels  or  support  adding  weights  to  each  individual  points.  Hence  we  replace  the  third  setup 
by  randomly  sampling  half  of  the  data  (27179  points)  and  compare  the  accuracy  of  SVM 
trained  from  this  huge  sample  to  the  ones  from  the  first  two  types  of  training  data.  For  all 
tested  classifiers,  The  accuracy,  computation  time  and  number  of  support  vectors  are  given 
in  table  2. 

The  first  observation  from  the  table  is  that  the  SVM  with  all  the  data  (27179  points) 
provides  the  best  test  classification  rate,  but  requires  the  longest  computation  time.  The 
accuracy  rates  of  the  bagging  SVM  and  the  SVM  on  bin  centers  are  comparable,  but  the 
bagging  SVM  needs  20  times  more  computation  time.  The  time  for  training  SVM  on  966  bin 
centers  is  2.56  minutes,  only  0.71%  of  5.99  hours  that  is  used  to  train  SVM  on  27179  samples. 
With  the  same  amount  of  computation,  the  accuracy  of  SVM  on  bin  centers  (86.08%)  is 
significantly  higher  (5  SE  above  the  average)  than  the  average  accuracy  (85.09%)  of  the 
same  sized  SVM  on  random  samples.  Therefore,  SVM  on  bin  centers  are  better  than  SVM 
on  the  same  sized  data  randomly  sampled  from  full  data.  Thus,  SVM  on  the  bin  centers 
is  the  computationally  most  efficient  method  for  training  SVM  and  it  provides  almost  the 
same  accuracy  to  the  full  size  SVM. 
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Besides  the  training  time,  the  number  of  support  vectors  determines  the  computation 
time  need  to  classify  new  data.  As  shown  in  the  table,  SVM  on  bin  centers  has  the  fewest 
number  of  support  vectors,  so  it  is  the  fastest  to  classify  a  new  point.  From  the  comparison, 
it  is  clear  that  SVM  on  the  binning  data  provides  almost  the  best  accuracy,  fast  training, 
and  fast  prediction. 

At  last,  we  compare  binning  with  another  possible  sample-size  reduction  scheme,  clus¬ 
tering.  Feng  and  Mangasarian  (2001)  has  proposed  to  use  the  k-mean  clustering  algorithm 
to  pick  a  small  proportion  of  training  data  for  SVM.  This  method  first  cluster  the  data  into 
m  clusters.  Although  this  method  reduces  the  size  of  training  data  as  well,  the  computation 
of  k-means  clustering  itself  is  very  expensive  comparing  to  that  of  training  SVM,  or  even 
not  feasible  due  to  the  memory  usage  when  data  size  is  too  large.  In  the  cloud  detection 
problem,  clustering  27179  training  data  into  512  groups  takes  21.65  minutes,  and  the  time 
increases  dramatically  when  the  number  of  centroid  increases.  The  increase  in  the  require¬ 
ment  of  computer  memory  is  even  worse  than  the  increase  in  the  computation  time.  The 
computer  memory  runs  out  when  we  try  to  cluster  the  data  into  966  clusters. 

Just  for  comparison,  clustering-SVM  and  binning-SVM  on  512  groups  provides  very  close 
classification  rates,  85.72%  and  85.64%  respectively,  but  binning  is  much  faster  than  cluster¬ 
ing.  Running  in  Matlab,  the  clustering  itself  takes  21.65  minutes,  which  is  376  times  of  the 
computation  time  (3.45  seconds)  of  binning  data  to  512  bins.  The  number  of  support  vectors 
of  clustering-SVM  and  binning-SVM  are  very  close,  145  and  143  respectively,  so  their  testing 
times  are  about  the  same.  Thus  binning  is  more  preferred  in  reducing  the  computation  for 
SVM  than  clustering. 


6  Summaries 

To  reduce  the  computational  burden  of  the  Gaussian  kernel  regularization  methods,  we  pro¬ 
pose  binning  on  training  data.  The  binning  effect  on  the  periodic  Gaussian  kernel  regular- 
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ization  method  is  studied  in  the  nonparametric  regression.  While  reducing  the  computation 
complexity  from  0(n3)  to  the  order  of  0(n )  or  less,  the  binned  estimator  keeps  the  asymp¬ 
totic  minimax  rates  of  the  periodic  Gaussian  regularization  in  the  Sobolev  spaces. 

Simulations  in  the  finite  sample  regression  case  suggests  that  the  performance  of  the 
binned  periodic  Gaussian  kernel  regularization  estimator  is  comparable  to  the  original  es¬ 
timator  in  terms  of  the  estimation  error.  In  our  simulation  of  binning  120  data  points  in 
20  bins,  computing  the  binned  estimator  only  takes  0.4%  of  the  computation  time  of  the 
unbinned  estimator,  but  the  binned  estimators  provide  almost  the  same  accuracy.  Binning 
the  Gaussian  kernel  regularization  also  gives  error  rates  close  to  those  using  the  full  data  in 
our  simulation. 

In  the  polar  cloud  detection  problem,  we  tested  the  binning  method  on  the  L 2  loss  Gaus¬ 
sian  kernel  regularization  and  the  Gaussian  kernel  SVM.  With  the  same  computation  time, 
the  L2-loss  Gaussian  kernel  regularization  on  966  bins  achieves  better  accuracy  (79.22%) 
than  that  (71.40%)  on  966  randomly  sampled  data.  For  SVM,  binning  reduces  the  com¬ 
putation  time  (from  5.99  hours  to  2.56  minutes  in  our  example),  keeps  the  classification 
accuracy,  and  speeds  up  the  testing  step  by  providing  simpler  classifiers  with  fewer  support 
vectors  (from  8630  to  210).  The  SVM  trained  on  966  randomly  selected  samples  has  a  similar 
training  time  as  and  a  slightly  worse  test  classification  rate  than  the  SVM  on  966  bins,  but 
has  67%  more  support  vectors  so  takes  67%  longer  to  predict  on  a  new  data  point.  The 
SVM  trained  on  512  cluster  centers  from  the  k-rnean  algorithm  reports  almost  the  same  test 
classification  rate  and  a  similar  number  of  support  vectors  as  the  SVM  on  512  bins,  but  the 
k-mean  clustering  itself  takes  375  times  more  computation  time  than  binning. 

In  summary,  binning  can  be  used  as  an  effective  method  for  dealing  with  large  number 
of  training  data  in  Gaussian  kernel  regularization  methods.  Binning  is  also  more  preferred 
in  reducing  the  computation  for  SVM  than  clustering. 
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Appendix 


Proof  of  Theorem  1: 

As  shown  in  section  3,  the  expansion  of  kernel  K(., .)  in  equation  (6)  leads  to 

OO 

G[nj  =  K(xi:  Xj )  =  2  e^?2^2/2[sin(27r/xi)  sm(2irlxj)  +  cos(2nlxi )  cos(27tZx,-)]  (19) 

1=0 

with  Xi  =  1  —  J-. 

L  n  In 

In  case  n  is  an  odd  number  (n  =  2q  +  1),  any  non-negative  integer  l  can  be  written  as 
l  —  kn  —  h  or  l  —  kn  +  h ,  where  both  k  and  h  are  integers  satisfying  k  >  0  and  0  <  h  <  q. 
For  any  k  >  1,  h  >  0,  and  all  i, 

sin(27r  (kn  +  h)xi)  =  sin(27r  knxi  +  2tt  hxi) 

=  sin(27r knxi)  cos(27t hx^  +  cos(2nknXi)  sin(27r hxi) 

=  sm(2kiTT  —  kn)  cos(2nhxi)  +  cos(2kin  —  kn)  sin(27r hxi) 

=  (— l)fc  sin(27r  hxi) 

In  the  same  way,  we  get  sin(2n(kn  —  h)xi)  =  (— l)fc+1  sin(27r hx^,  and  cos(2n(kn  +  h)xi)  = 
cos(27r(fcn  —  h)xi)  =  (— l)fc  cos(2nhxi).  In  case  h  =  0,  we  have  sin(2nknxi )  =  0  and 
cos(27T knx^  =  (— l)k  for  all  i.  Therefore,  the  Gram  matrix  G^1'  can  be  written  as: 

g 

—  do  +  ^2\dh  sin(27r hxi)  sm(2nhxj)  +  cos(27t hxi)  cos(2nhxj)\  (20) 

h= 1 

where 

OO 

=  2  J2(-l)keHkn)2u)2/2, 

k= 0 
oo 

rfS  _  2|e-ft2^2/2  _|_  ^(_]^fc(e-(fc™+k)2^2/2  _  e-(kn-h)2u2/ 2^ 
k=  1 

OO 

ffh  —  2|e_/l2^2/2  +  ^^(  —  l)fc(e-(^+^)2^2/2  _|_  e~(kn-h)2ui2 /2^y 
k= 1 
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Let  U0(n)  =  ^(1,  •  •  • ,  1)T,  V^_x  =  y|(sin(2vr/ixi),  •  •  • ,  sin(27r hxn))T,  and 

=  \J^(cos[2'Khxi) ,  •  •  • ,  cos(27r hxn))T  ,  for  h  —  1,  •  •  • ,  q.  Using  the  following  orthogonal 
relationships 

n 

sin(27 TfiXi)  sin(27 tux/) 

2=1 

n 

COs(27 TfiXi)  COs(27 TZ/Xj) 

2=1 

n 

cos(27r/iXj) 

2=1 
n 

sin(27r/iXj) 

2=1 

we  can  easily  see  that  Vo,  •  •  • ,  V^  are  orthonormal  vectors.  Furthermore,  they  are  the  eigen¬ 
vectors  of  with  corresponding  eigen- values  d!/l)  =  nd$  ,  c^I-i  —  nd^/2,  and  =  nd//  / 2, 
since  d^Vj^Vj^1 .  It  completes  the  proof  for  odd  number  n  =  2q  +  1. 

For  even  number  n  =  2q  observations  ,  the  eigen-vectors  of  G^  are  V^n\  •  •  • ,  V^^  and 
eigen- values  are  dg*\  •  •  •  1d^J_1,  while  both  are  the  same  as  defined  in  the  odd  number  case. 
All  the  proofs  for  odd  numbers  n  hold  here  except  sin(27r/cgay)  =  sin(27r kq(i/n  —  2 /n))  =  0 
for  all  k  >  0,  which  leaves  Vo,  •  •  • ,  V2q-2,  V2q  as  the  2 q  eigen-vectors.  The  eigen-vectors  are 
slightly  different  of  those  for  odd  number  n,  but  this  difference  does  not  affect  the  asymptotic 
results  at  all.  Therefore,  we  will  use  the  eigen-structure  for  odd  number  observations  in  the 
rest  of  the  paper. 


=  n/2  fi  —  u  —  1,  •  •  • ,  q 
=  0  fi^u]ii,u  =  0,---,q 
=  n/2  fj,  —  u  —  1,  ■  ■  ■  ,q 
=  0  n^w,n,v  =  0  ,---,q 
=  0  n  =  l 

=  0  //  =  !,■■•,? 


To  simple  the  notation,  we  can  write  them  eigen- values  di  in  terms  of  pf. 

OO 

4"°  =  2n^(-l)k  p2kn 

k= 0 

oo 

d[n)  =  n{p/  +  ^(-l)fc[A>fcn+h  +  (-l),_2Wh]} 

k=  1 

where  l  —  1,  •  •  •  ,n  —  1,  and  h  —  [(/  +  l)/2],  while  [a]  stands  for  the  integer  part  of  a.  □ 
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Proof  of  Proposition  1: 

For  x  €  (0, 1],  x  as  defined  in  proposition  1,  and  k  >  0 

m 

^  K ( x ,  Xj)cos{2'Kkxj) 

3= 1 

m  oo 

=  exp(— 12lu2 /2)cos(2nl(x  —  Xj))cos(2nkxj)} 

3=1  1=0 

oo  m 

=  2^exp(-/V/2){^  cos(2ttI(x  —  Xj))cos(27rfca;:,)} 
z=o  j= i 

oo  m 

=  ^exp(-ZV/2){J^  cos(2n{lx  +  (k  —  V)Xj ))  +  cos(2n{lx  —  (k  +  l)xj))} 

1=0  j= 1 

For  any  integer  r, 


m 

yy  cos(2tt(Ix  +  rXj)) 
i=i 


V  COs(2,kIx - 7T  +  27T — j) 

m  m 

3=1 


{ 


0 


m(— l)r//mcos(27r/a;) 


when  — 


when  — 

m 


is  not  an  integer; 
is  an  integer. 


Therefore, 

m 

yy  K (x,  Xj)cos(2nkxj)  =  cos(2nkx) 

j=i 

while  dj™  follows  the  definition  in  Proposition  ??.  It  is  also  true  that  Y^j=i  K(x,  Xj)sin(2nkxj) 
sin(2nkx) .  As  shown  in  Proposition  ??,  the  eigen-vector  V^m>  of  G (m')  is  \j2/mcos{2nXj) 
or  \j2/msin[2'KXj).  Therefore, 


£j»(n,m) -yd™) 


for  all  k  —  0, 1,  ■  ■  • ,  m. 


□ 


Proof  of  Theorem  2: 

Following  the  relationship  shown  in  proposition  1  G,(n:'m)p("1)  =  diag(d[m^) ,  with 

as  the  n  by  m  matrix  formed  by  the  first  m  eigen- vectors  of  G^m\  The  projection 


matrix  Sb  =  G^n,rn^V^diag(- 


r(m) 


+As 


)y{m)T  B{m,n)  =  /Wy(n,m)  diag{ 


j(™) 


,(m) 
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Since  B^m,n'>  B^"1^1  =  diag(m/n),  the  asymptotic  variance  of  the  estimator  is: 


—  variiji )  =  —  trace(SgSB) 

n  n 

1  n  t  t 

'  '  V  -T  -r  ( rn  m  If.  /  '-*/7  \  -r  r  I  t~Y>  >  -w— »  l  m  f)  1  -I—.  I  •m  n\-‘-  -r 


d? 


=  -  trace(-V{n’m)diag( - )P(m)  5(m’n) B{m'n)  V{m)diag(  .  , 

n  rn  V^(m)  +  \B  d,(m)  +  As 


)P(' 


n,m ) J 


d 


(m) 


m— 1 


d 


(m) 


=  -  trace(diag(  jm)  f)  =  “  E  (>)  ,  A  7 
n  d\  ’  +  \B  n  l=0  d\  +  XB 

As  proved  before,  lirnm^(Xdjin'1  /rn  =  pi  for  l  >  0  and  pi  =  1/A,  we  get 

Pi 


1  J^variVi)  ~ 


n 


Pi  +  (XB/m)  n 


2  _  1  +  A'Vba-S 

^  m 


Proof  of  Theorem  3: 

The  bias  of  the  binned  estimator  is  1  YhBias2{yi)  =  y((*S's  —  I)F)T((SB  —  I)F).  Let  C<'n,m^ 
denote  anbym  matrix  of  (Imxm  :  Omx(n-m))T-  The  term  (S'#  —  /)F  is  expanded  as: 


(. SB-I)F  =  (G(n’m)(G(m)  +  XBI)~lB^n)  -J)P(n)0( 


n) 


/" —  j(m) 

—  V{n’m)diaq(—^ - )pWr5(m'n)p(n)0('>)  _  p(«)©W 

m  dm)  +  AR 


'A(m)  +  As 

H  d fm) 


—  A171) 

- )F(m)T5(m,»)F(»)Q(«)  _  y(")0(n) 

™d[m)  +  AR 

V^\C^n’m)diag(^ 


A(m)  +  As 

j(m) 

"  ai  _  /Wj@G) 


d;(m)  +  A 


Now,  let  us  study  We  first  start  with  one  of  the  b^’s  eigen-vectors: 

a/ 2/n(cos2irkxi ,  •  •  • ,  cos2irkxn)T . 

B<'m,n\cos2nkxi,  -  -  - ,  cos2'Kkxn)T 

=  ((cos27rfca;i  +  •  •  •  +  cos2Tikxp) / p,  •  •  • ,  (cos27rfc:rn_p+i  +  •  •  •  +  cos2iikxn)  /  p)T 
—  wl/n'"\cos2Tckxi,  •  •  • ,  cos2'nkxm)T , 

while  w[/n'n>  is  a  constant  as  function  of  n,  m,  and  k.  When  p  =  n/m  is  an  odd  num¬ 
ber,  (avp+i,  •  •  • , xrp+p)  is  expressed  as  (ay  —  (p  —  l)/2 n,  ■  ■  ■  ,xr,  ■  ■  ■  ,xr  +  (p  —  l)/2n).  Thus, 
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cos2nkxrp+i  H —  •  +  cos2nkxrp+p  =  [1  +  2 cos^y  H - 1-  2 cos ^((p-1)/2) ] Cos2'xkxr.  Therefore, 

vj[m"'n>  =  (1  +  V)'2  2 cos  )  / p  for  odd  number  p.  It  is  straightforward  to  show  that 

w  ™"n'>  =  ( 2 cos  27Tk^nk )  / p  for  even  number  p.  In  the  same  way,  we  have 

B<'m,n\sin2nkxi,  •  •  • ,  sin2izkxn)7  =  w^n’n\sin2nkxi,  •  •  • ,  sin27ikxm)T 

Let  j°  =  |~ (j  +  1)/2~|  for  0  <  j  <  n  —  1.  Following  the  proof  of  proposition  1  and  assuming 
m  =  2q  +  1  as  an  odd  number,  we  can  write  any  j°  as  j°  =  hm  —  i°  or  j°  =  hm  +  i°  with 
0  <  i°  <  q,  where  i°  is  a  function  of  j  and  m.  For  the  situation  of  odd  number  j  and 
j°  —  hm  +  i°,  we  have 


g(m,n)y{n) 


—  Bl'm,n'>  ^/2/n(sin2irj°xi,  •  •  • ,  sin2nj°xn)T 
=  \J 2 /n(sin27rj°xi ,  •  •  • ,  sin27ij°xn)T 

=  w^’n^ y/2/n(—l)h(sin2ni°Xi,  •  •  • ,  sin2ni°xn)T 


Similarly,  we  can  derived  the  equation  for  even  number  j  and  j°  =  hm  —  i°.  So  the  structure 
of  F(m)TBKdp(n)  is: 


y(m)TB(m,n)v(n)  =  ^ y ~  WM  m,n 


2*°+((— 1)J  — 1)/2 

where  the  constant  c”*’n  equals  (— l)/l  when  (1)  j  is  even  or  (2)  j  is  odd  and  j°  =  hm  + 
i°,  and  it  equals  (— l)/l+1  otherwise.  Therefore,  the  matrix  is  nonzero  only  when  i  = 

(m,n)  m,n 

{~±) - J -)/ *  —  J •  uo  UCliUUC  fj>ij  —  u’ 

yWT5(m,n)y(n)i  which  is  in  the  following  shape: 


2 i°  +  ((— 1)J  —  l)/2  =  j.  Let  us  denote  pt]  =  wy^,n>  c^J1  for  the  nonzero  elements  of  matrix 


( 


h0,0  0 

0  pip  0 

0  0  ■■■ 


0 

0 

0 


0 

0 

pm—2,m 


0 

0 

0 


y  0  0  '  hm—l,m—l  0  Pm—  1, 


m+1 


h0,2m-l  0  0 

0  Pi, 2m  0 

0  0 

0  0 


\ 


...  y 


j(m) 


Since  A<~n,n'>  =  C^n,m'>diag(y/^-p^) - )Vhm)  —  /(”),  the  entry  of  of  A^n,n^  is  ai3  = 

d,  "b A b 
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4m) 

4m,+As 


Hij  —  I (j  =  i)  for  0  <  i  <  m,  and  =  —I(j  =  i)  for  all  i  >  m.  So  A<'n,n'>  i 


is 


,(m ) 

diag{  ?m)  m  A^n~m^ 


Now  let  us  study  of  the  bias  as  a  whole  term. 

1  V  Bias^Vi)  =  -  (( SB  -  I)F)t((Sb  -  I)F ) 

^ '  n 


n 


—  _  (y(n)^{n,n)Q(n)\Ty(n)^(n,n)Q(n) 


n 


1  o(n)  o(n) 

_  _  _ \T^(n,n)T^(n,n)24 _ 

n  v  \fn  '  \fn 


v-  e-\2/  4m)/%  . o  ^  ©7  \2/  , 


c 

j=o 


*-l  C|(") 


Am) 

a,  u 

j 


■M  \  2 


j=m 


Y) 


m-1  n-1  r\(n) p>(™)  Am) 

tJk  Vj  dk  (lkk 


EE 

fc=0  i=m 


n 


'4m)  +  A  B 


j(m) 

1)(  ,  \  ^  )I(k  =  j) 
4m)  +  AB 


n-1  m-1  o(foo(n)  j(m)  jM., 

V  V  _  1)/(i  =  I, 

"  X”’  +  Ab  4>  +  Ab 

W  W  ft  t*L  )(  f*  "w  )/(j  =  i)/(j  ^  fc) 

n  4”>  +  Ab  4>  +  Ab 

n-i  ft(")  d[m)u- 

-1  )2  +  E(^E)2(i  +  (Eww)2) 

j™  4"  4  ’  +  Ab 


k=m  j=m 

m- 1  qW  Am) 

U  V5  4">  +  Ab 


^33 


when  n  — >  oo,  m  — >  oo  and  dm/ (dm  +  A)  — >  0.  Since  c’E'1  =  1  for  j  =  1,  •  •  • ,  m  and  w0 
as  m/n  —a  0,  we  have  prj  —a  1.  Therefore, 

m—  1 


(m,n) 


l 

n 


Bias2  ({ji)  ~ 

3=0 
m—  1 

E4 


ft 


Pj  +  XB/m 


j=m 


3=0 

m—  1  \  /  oo 

As/m,  .  2  A)  2 


Pi  +  As/j 


m 


Pj\B/m 


j=o 

m—  1 

£  1  +  (5j\B/m 

J=u  J  1  j=m 


D2  +  E  4 

3  = 
oo 

j=m 
oo 

E»; 
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